13-1 Speaker Recognition

The speech signal contains both the message being spoken and the information of the speaker. Therefore we can use the speech signal for both speech and speaker recognition. The goal of these two tasks can be summarize as follows: More specifically, speaker recognition can be further divided into two different tasks:
  1. Speaker identification: To determine which one of a group of known voices best matches the input voice sample.
  2. Speaker verification: To determine from a voice sample if a person is whom s/he claims to be.
For both of the above tasks, the utterances also have two types:
  1. Text dependent: The utterances can only be from a finite set of sentences.
  2. Text independent: The utterances are totally unconstrained.
In generally, speaker identification/verification with text-dependent utterances has a higher recognition rate.

As reported in the literature, GMM is one of the the most effective methods for speaker identification/verification. The probability density function of a GMM is a weighted sum of M component densities:

p(x|l)=Si=1M wig(x, mi, Si),
where x is a d-dimensional random vector, wi, i=1, ..., M, are the mixture weights whose summation is equal to 1, and g(x, mi, Si), i=1, ..., M, are the component densities in the form of the d-dimensional Gaussian function:
g(x, mi, Si) = (2p)-d/2*|Si|-0.5*exp[-(x-mi)TSi-1(x-mi)/2]
Moreover, the parameter l is a set of parameters of a GMM:
l={(wi, mi, Si) | i=1, ..., M}

For speaker identification, a group of S speakers S = {1, 2, .., S} is represented by a set of GMM denoted by l1, l2, ..., lS. The objective is to find the speaker model with the maximum a posteriori probability for a given same-speaker observation sequence X = {x1, x2, ..., xt, ..., xT}, where each xt is a feature vector of a frame at index t. Under the assumption that that each frame is independent, the predicted model spredicted can be denoted as:

spredicted = arg maxk=1~S St=1T log p(xt|lk)
Based on the length of the observation sequence X, we can define different recognition rates:
  1. Frame-based recognition rate: X contains a single frame only. The frame-based recognition rate is usually lower since it is less stable.
  2. Segment-based recognition rate: X contains a sequence of a fixed length. We usually employ the concept of a moving window to extract a segment. For instance, if the frame step is 10 ms and an utterance contains 500 frames, then we can extract 301 2-second segments (each contains 200 frames) by using a moving window. Usually we vary the segment length to evaluate the recognition rate as a function of the segment length.
  3. Utterance-based recognition rate: X contains frames from a single utterance.
The steps for speaker identification can be summarized as follows:
  1. Feature extraction using the training set.
  2. GMM training to obtain a set of GMM parameters denoted by l1, l2, ..., lS.
  3. Recognition rate computation based on the test set. The resultant confusion matrix is S by S.

On the other hand, for speaker verification, we need to identify both the speaker model and its anti-speaker model for each of the speaker. The kth speaker model is identified by using the training set from speaker k. For kth anti-speaker model, the training data is obtained from all the other speakers except for speaker k. Then we compare the log probabilities of both the speaker and its anti models to determine if we should accept or reject the claim.

The steps for speaker identification can be summarized as follows:

  1. Feature extraction using the training set.
  2. GMM training to obtain a set of speaker models denoted by l1, l2, ..., lS.
  3. GMM training to obtain a set of anti-speaker models denoted by l~1, l~2, ..., l~S.
  4. Recognition rate computation based on the test set. For a given sample voice of speaker k, we shall evaluate log probabilities based on the speaker model lk and the anti-speaker model l~k. Accept the claim if the difference in log probabilities is greater than a speaker-dependent threshold. The resultant confusion matrix is 2 by 2.

References


Audio Signal Processing and Recognition (音訊處理與辨識)